This is a short data analysis of Airbnb listings in New York City (NYC) in 2019. The data was taken from https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/.
Package imports:
library(tidyverse)
library(knitr)
Read in dataset:
df <- read_csv("airbnb_nyc_2019.csv",
col_types = cols(host_id = col_character(),
id = col_character(),
last_review = col_date(format = "%Y-%m-%d")))
The dataset has 48895 rows and 16 columns. Here are the first 6 rows of the dataset:
kable(head(df))
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
3647 | THE VILLAGE OF HARLEM….NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NA | NA | 1 | 365 |
3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
5099 | Large Cozy 1 BR Apartment In Midtown East | 7322 | Chris | Manhattan | Murray Hill | 40.74767 | -73.97500 | Entire home/apt | 200 | 3 | 74 | 2019-06-22 | 0.59 | 1 | 129 |
Here are the column names of the dataset:
names(df)
## [1] "id" "name"
## [3] "host_id" "host_name"
## [5] "neighbourhood_group" "neighbourhood"
## [7] "latitude" "longitude"
## [9] "room_type" "price"
## [11] "minimum_nights" "number_of_reviews"
## [13] "last_review" "reviews_per_month"
## [15] "calculated_host_listings_count" "availability_365"
Each row in this dataset is an Airbnb listing. Sanity check: listing IDs are unique in this dataset.
length(unique(df$id))
## [1] 48895
We note that there are some NA
values in the reviews_per_month
column. This probably because there were zero reviews. Let’s fill that in:
# if reviews_per_month is empty, it probably means zero reviews
df$reviews_per_month <- replace_na(df$reviews_per_month, 0)
Manhattan has the most number of listings, followed by Brooklyn. It looks like most of the listings are either “Entire home/apt” or “Private room”, with a pretty even split between these two types.
ggplot(df, aes(x = fct_infreq(neighbourhood_group), fill = room_type)) +
geom_bar() +
labs(title = "No. of listings by borough",
x = "Borough", y = "No. of listings") +
theme(legend.position = "bottom")
Below is a plot of the top 10 neighborhoods by number of listings. All of them are either from Brooklyn or Manhattan.
df %>%
group_by(neighbourhood) %>%
summarize(num_listings = n(),
borough = unique(neighbourhood_group)) %>%
top_n(n = 10, wt = num_listings) %>%
ggplot(aes(x = fct_reorder(neighbourhood, num_listings),
y = num_listings, fill = borough)) +
geom_col() +
coord_flip() +
theme(legend.position = "bottom") +
labs(title = "Top 10 neighborhoods by no. of listings",
x = "Neighborhood", y = "No. of listings")
The plot below shows the distribution of price by room type. (Note that the y-axis is on a log scale.) There is much variation in price within each room type. Overall, it looks like “Entire home/apt” listings are slightly pricier than “Private room”, which in turn are more expensive than “Shared room”. This makes intuitive sense.
ggplot(df, aes(x = room_type, y = price)) +
geom_violin() +
scale_y_log10()
In making this plot, we noticed that 11 listings had price as zero. We are not sure why this is the case, but since it is such a small fraction of listings, we will ignore it for this analysis.
df %>% filter(price == 0) %>%
select(name, host_id, host_name, neighbourhood_group, room_type, minimum_nights)
## # A tibble: 11 x 6
## name host_id host_name neighbourhood_g… room_type minimum_nights
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 Huge Brook… 8993084 Kimberly Brooklyn Private … 4
## 2 ★Hostel St… 1316975… Anisha Bronx Private … 2
## 3 MARTIAL LO… 15787004 Martial … Brooklyn Private … 2
## 4 Sunny, Qui… 1641537 Lauren Brooklyn Private … 2
## 5 Modern apa… 10132166 Aymeric Brooklyn Entire h… 5
## 6 Spacious c… 86327101 Adeyemi Brooklyn Private … 1
## 7 Contempora… 86327101 Adeyemi Brooklyn Private … 1
## 8 Cozy yet s… 86327101 Adeyemi Brooklyn Private … 1
## 9 the best y… 13709292 Qiuchi Manhattan Entire h… 3
## 10 Coliving i… 1019705… Sergii Brooklyn Shared r… 30
## 11 Best Coliv… 1019705… Sergii Brooklyn Shared r… 30
Does the number of listings in a neighborhood affect the prices of those listings? For each neighborhood, we look at the number of listings as well as its median price. In the plot below, each neighborhood is presented by one point, and its color represents the borough it belongs to.
# compute summary statistics for each neighborhood
nhd_df <- df %>%
group_by(neighbourhood) %>%
summarize(num_listings = n(),
median_price = median(price),
long = median(longitude),
lat = median(latitude),
borough = unique(neighbourhood_group))
nhd_df %>%
ggplot(aes(x = num_listings, y = median_price, col = borough)) +
geom_point(alpha = 0.5) + geom_smooth(se = FALSE) +
scale_x_log10() + scale_y_log10() +
theme_minimal() +
theme(legend.position = "bottom")
Within each borough, it looks like the number of listings in a neighborhood does not have much of an impact on the median price of the listing.
library(ggmap)
# get top 50 listings by price
top_df <- df %>% top_n(n = 50, wt = price)
# get background map
top_height <- max(top_df$latitude) - min(top_df$latitude)
top_width <- max(top_df$longitude) - min(top_df$longitude)
top_borders <- c(bottom = min(top_df$latitude) - 0.1 * top_height,
top = max(top_df$latitude) + 0.1 * top_height,
left = min(top_df$longitude) - 0.1 * top_width,
right = max(top_df$longitude) + 0.1 * top_width)
top_map <- get_stamenmap(top_borders, zoom = 12, maptype = "toner-lite")
# map of top 50 most expensive
ggmap(top_map) +
geom_point(data = top_df, mapping = aes(x = longitude, y = latitude,
col = price)) +
scale_color_gradient(low = "blue", high = "red")
Most of them are located in Manhattan.
In the map below, each dot is one neighborhood. The size of the dot depends on the number of listings and the color of the dot depends on the median price in that neighborhood.
# map of all listings: one point per neighborhood
height <- max(df$latitude) - min(df$latitude)
width <- max(df$longitude) - min(df$longitude)
borders <- c(bottom = min(df$latitude) - 0.1 * height,
top = max(df$latitude) + 0.1 * height,
left = min(df$longitude) - 0.1 * width,
right = max(df$longitude) + 0.1 * width)
map <- get_stamenmap(borders, zoom = 11, maptype = "toner-lite")
ggmap(map) +
geom_point(data = nhd_df, mapping = aes(x = long, y = lat,
col = median_price, size = num_listings)) +
scale_color_gradient(low = "blue", high = "red")
The median price for most neighborhoods is quite low; it looks somewhat elevated in Manhattan. Also, there are one or two neighborhoods with very high median prices in Staten Island: this is worth investigating further.
There is much in this dataset that we have not explored yet. At first glance, it appears that room type and neighborhood have an effect on the listing price, but not the number of listings in the neighborhood.